Communicate Data Finding
Prosper Loan Data

Table of Contents

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('seaborn-darkgrid')
In [2]:
# set figure size and figure format 

plt.rcParams['figure.figsize'] = [8, 6]
%config InlineBackend.figure_format = 'svg'
In [3]:
# to keep the plot in fixed, so that I don't have to scroll
In [4]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines){
    return false;
}
In [5]:
df = pd.read_csv('data/prosperLoanData.csv')
In [6]:
df.head(6)
Out[6]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20
5 0F05359734824199381F61D 1074836 2013-12-14 08:26:37.093000000 NaN 60 Current NaN 0.15425 0.1314 0.1214 ... -25.33 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1

6 rows × 81 columns

In [7]:
df.shape
Out[7]:
(113937, 81)

Univariate Plot

What is the structure of your dataset?

There are 113937 observation and 81 variables.

What is/are the main feature(s) of interest in your dataset?

There are so many variables in this dataset. Looking only variables name it is hard to find what is most interesting. After some analysis I found that some features are more interesting than others. Through out the project I will work with Borrower Rate, Borrower APR, Prosper Score, Credit Score, Original Loan Amount, Monthly Payment, Borrower Occupation, Borrower State, Borrower Employment Status and some other feature if it seems interest later on.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I will look into how Borrower rate and APR changes over time and what make borrower a defaulter. For that I think Loan Origination Date, Borrower Rate and APR, Prosper Score, Income Range, Employment Status and some others feature will help to investigate my feature of interest.

Occupation

In [8]:
base_color = sns.color_palette()[0]
In [9]:
# Ocupation bar plot
df.Occupation.value_counts().head(20).plot(kind = 'bar', color = base_color);
plt.xlabel('Occupation')
plt.ylabel('Count');
plt.title('Occupation Types Count');

Other is the highest category occupation that borrow money from prosper. others notable category are Professional and Computer Programmer.

Prosper Score

In [10]:
# Prosper Score bar plot
base_color = sns.color_palette()[0]
df['ProsperScore'].value_counts().plot(kind = 'bar', color = base_color);
plt.xlabel('Prosper Score')
plt.ylabel('Count')
plt.title("Prosper Score Count");

A majority of borrower have prosper score either 4,6, or 8.

In [11]:
prosperscore = df['ProsperScore'].dropna()
In [12]:
level_order = ['1.0','2.0', '3.0', '4.0', '5.0', '6.0', '7.0', '8.0', '9.0', '10.0', '11.0']
order_cat = pd.api.types.CategoricalDtype(ordered = True, categories=level_order)
prosperscore = prosperscore.astype(order_cat)
In [13]:
# color = '#099DD9' can be use , good one
# color_palette_list = ['#009ACD', '#ADD8E6', '#63D1F4', '#0EBFE9', '#C1F0F6', '#0099CC']
# prosper score count plot
#base_color = sns.color_palette()[9]
# 'darkturquoise'
sns.countplot(data = df, x = 'ProsperScore', color = base_color);
plt.xticks(rotation = 0)
plt.xlabel('Prosper Score')
# plt.style.use()
plt.ylabel('Count');
In [14]:
# color = '#099DD9' can be use , good one

# prosper score count plot
#base_color = sns.color_palette()[]
sns.countplot(data = df, x = 'ProsperScore', color = '#099DD9' );
plt.xticks(rotation = 0)
plt.xlabel('Prosper Score')
plt.ylabel('Count');
plt.title('Prosper Score Count');

Borrower Rate

In [15]:
df['BorrowerRate'] =df['BorrowerRate'].apply(lambda x: x *100)
In [16]:
# Borrower Rate

ticks =np.arange(0, 50 ,5)
labels = ['{}'.format(v) for v in ticks]
bins = np.arange(0, df['BorrowerRate'].max()+5, 5)
plt.hist(data = df, x = 'BorrowerRate', bins = bins);
plt.xticks(ticks, labels);
plt.xticks(rotation = 90);
plt.xlabel('Borrower Rate')
plt.ylabel('Count')
plt.title("Borrower Rate Count");

most people have borrower rate between 5 to 35%. Mean is 18.4%.

In [17]:
df['BorrowerRate'].describe()
Out[17]:
count    113937.000000
mean         19.276406
std           7.481843
min           0.000000
25%          13.400000
50%          18.400000
75%          25.000000
max          49.750000
Name: BorrowerRate, dtype: float64
In [18]:
df['BorrowerAPR'].describe()
Out[18]:
count    113912.000000
mean          0.218828
std           0.080364
min           0.006530
25%           0.156290
50%           0.209760
75%           0.283810
max           0.512290
Name: BorrowerAPR, dtype: float64

Borrower APR

In [19]:
# Borrower APR
ticks =list(np.arange(0, 0.5, 0.025))
labels = ['{:.2f}'.format(v) for v in ticks]
bins = np.arange(0, df['BorrowerAPR'].max()+.02, .02)
plt.hist(data = df, x = 'BorrowerAPR', bins = bins);
plt.xticks(ticks, labels);
plt.xticks(rotation = 90); 
plt.xlabel('Borrower APR')
plt.ylabel('Count')
plt.title("Borrower APR Count");

mean borrower APR around 0.20. One interesting thing is that lots of people have APR 0.35.

Employment Status

In [20]:
#create new data frame with employment status
employment = df['EmploymentStatus']
In [21]:
#convert series to dataframe
employment = employment.to_frame()
In [22]:
# drop NA values
employment.dropna(inplace= True)
In [23]:
employ_count = employment['EmploymentStatus'].value_counts()
In [24]:
n_employment = employment.shape[0]
max_employ_count = employment['EmploymentStatus'].value_counts()[0]
max_prop = max_employ_count/n_employment
print(max_prop)
0.6028008094410917
In [25]:
np.arange(0, max_prop, 0.05)
Out[25]:
array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 ])
In [26]:
tick_prop = np.arange(0, max_prop, 0.05)
tick_name = ['{:0.2f}'.format(v) for v in tick_prop]
In [27]:
type_order = employ_count.index
In [28]:
sns.countplot(data = employment, y = 'EmploymentStatus', color = base_color,
             order = type_order)
plt.xticks(tick_prop * n_employment, tick_name)
plt.xlabel('Proportion')


for i in range(employ_count.shape[0]):
    count = employ_count[i]
    pct_string = '{:0.1f}%'.format(100 * count/ n_employment)
    plt.text(count+1, i, pct_string, va = 'center')
In [29]:
sns.countplot(data = df, x = 'EmploymentStatus', color = base_color, order=type_order);
plt.xticks(rotation = 15);
plt.xlabel('Employment Status')
plt.ylabel('Count');
plt.title('Employment Status Count');

Most people is either employed or full-time.

Home Owner

In [30]:
# pie chart
plt.figure(figsize=(6,4))
sorted_count = df['IsBorrowerHomeowner'].value_counts()
plt.pie(sorted_count, labels= sorted_count.index, startangle=90, counterclock=False)
plt.axis('square');
plt.title('Is Borrower Home Owner');

half of the people home owner and other half not

In [31]:
df['IsBorrowerHomeowner'].value_counts()
Out[31]:
True     57478
False    56459
Name: IsBorrowerHomeowner, dtype: int64
In [32]:
# is borrower home owner
plt.figure(figsize=(6,4))
sns.countplot( data = df, x = 'IsBorrowerHomeowner');
plt.xlabel("Home Owner")
plt.ylabel("Count")
plt.title("Is Borrower Home Owner");

Income Range

In [33]:
df['IncomeRange'].value_counts()
Out[33]:
$25,000-49,999    32192
$50,000-74,999    31050
$100,000+         17337
$75,000-99,999    16916
Not displayed      7741
$1-24,999          7274
Not employed        806
$0                  621
Name: IncomeRange, dtype: int64
In [34]:
type_order_income = df['IncomeRange'].value_counts().index
In [35]:
sns.countplot(data = df, x = 'IncomeRange', color = base_color, order  = type_order_income)
plt.xticks(rotation = 90);
plt.xlabel('Income Range')
plt.ylabel('Count')
plt.title('Income Range Count');

Most people have income form 25000 to 75000 dollar.

Lender Yield

In [36]:
# lender yield

ticks =list(np.arange(0, 0.5, 0.025))
labels = ['{:.2f}'.format(v) for v in ticks]
bins = np.arange(0, df['LenderYield'].max()+.02, .02)
plt.hist(data = df, x = 'LenderYield', bins = bins);
plt.xticks(ticks, labels);
plt.xticks(rotation = 90); 
plt.xlabel('Lender Yield')
plt.ylabel('Count')
plt.title('Lender Yield Histogram');

Lender yield is normally distribute and this plot quite similar to borrower APR

Total Inquiries

In [37]:
# total inquery , limit to 20 value
order_type_inq = df['TotalInquiries'].value_counts().head(20).index
In [38]:
# use order type for highest to lowest
plt.figure(figsize=(8,6))
sns.countplot(data =df, x = 'TotalInquiries', order = order_type_inq, color=base_color);
plt.xticks(rotation = 15);
plt.xlabel('Total Inquiries')
plt.ylabel('Count')
plt.title('Total Inquiries Count');

most people have inquiries less than 6. highest inquiries is 2.

In [39]:
# no order type, use x limit to take only first 20 value, 

plt.figure(figsize=(8,6))
sns.countplot(data =df, x = 'TotalInquiries', color= base_color);
plt.xticks(rotation = 20);
plt.xlim((-1,20.5))
plt.xlabel('Total Inquiries')
plt.ylabel('Count')
plt.title('Total Inquiries Count');

Original Loan Amount

In [40]:
ticks =np.arange(0, 36000 ,2000)
labels = ['{}'.format(v) for v in ticks]
bins = np.arange(df['LoanOriginalAmount'].min(), df['LoanOriginalAmount'].max()+1000, 1000)
plt.hist(data = df, x = 'LoanOriginalAmount', bins = bins);
plt.xticks(ticks, labels);
plt.xticks(rotation = 90);
plt.xlabel('Loan Original Amount')
plt.ylabel('Count')
plt.title('Loan Original Amount Count');
In [41]:
# order from highest  count to lowest count
df['LoanOriginalAmount'].value_counts().head(20).plot(kind = 'bar', color = base_color);
plt.xlabel('Loan Original Amount')
plt.ylabel('Count')
plt.title('Loan Original Amount Count');

Notable amount that borrowed are 4000, 15000, and 10000 dollar.

In [42]:
# cut continuous to discrete variable
range_bin = np.arange(1000,df['LoanOriginalAmount'].max() + 1000, 1000)
bins = pd.cut(df['LoanOriginalAmount'], range_bin)
In [43]:
loan_original = pd.value_counts(bins)
In [44]:
# # order from highest  count to lowest count
plt.figure(figsize= (12,8))
loan_original.plot(kind = 'bar', color = base_color);
plt.xlabel('Loan Original Amount')
plt.ylabel('Count')
plt.title('Loan Original Amount Count');
In [45]:
# order from lowest amount to highest amount
loan_original = pd.value_counts(bins).sort_index()
plt.figure(figsize= (12,8))
loan_original.plot(kind = 'bar', color = base_color);
plt.xlabel('Loan Original Amount')
plt.ylabel('Count')
plt.title('Loan Original Amount Count');

Credit Score

In [46]:
# histogram
bins = np.arange(400, 900, 20)
df['CreditScoreRangeLower'].plot(kind = 'hist', bins = bins)
plt.xlim((400, 900))
plt.xlabel('Credit Score Range Lower')
plt.title('Credit Score Range Lower Frequency');

mean credit score was 680.

In [47]:
# Credit Score Range Upper histogram
bins = np.arange(400, 900, 20)
df['CreditScoreRangeUpper'].plot(kind = 'hist', bins = bins)
plt.xlim((400, 1000))
plt.xlabel('Credit Score Range Upper')
plt.title('Credit Score Range Upper Frequency');

mean credit score was 700.

In [48]:
df[['CreditScoreRangeLower','CreditScoreRangeUpper', 'FirstRecordedCreditLine',

'CurrentCreditLines', 'RevolvingCreditBalance',  'BankcardUtilization',

'StatedMonthlyIncome', 'TotalProsperLoans', 'OnTimeProsperPayments',

'LoanOriginationDate', 'LoanOriginationQuarter', 'MonthlyLoanPayment',

'InvestmentFromFriendsCount', 'Investors']].describe()
Out[48]:
CreditScoreRangeLower CreditScoreRangeUpper CurrentCreditLines RevolvingCreditBalance BankcardUtilization StatedMonthlyIncome TotalProsperLoans OnTimeProsperPayments MonthlyLoanPayment InvestmentFromFriendsCount Investors
count 113346.000000 113346.000000 106333.000000 1.063330e+05 106333.000000 1.139370e+05 22085.000000 22085.000000 113937.000000 113937.000000 113937.000000
mean 685.567731 704.567731 10.317192 1.759871e+04 0.561309 5.608026e+03 1.421100 22.271949 272.475783 0.023460 80.475228
std 66.458275 66.458275 5.457866 3.293640e+04 0.317918 7.478497e+03 0.764042 18.830425 192.697812 0.232412 103.239020
min 0.000000 19.000000 0.000000 0.000000e+00 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 1.000000
25% 660.000000 679.000000 7.000000 3.121000e+03 0.310000 3.200333e+03 1.000000 9.000000 131.620000 0.000000 2.000000
50% 680.000000 699.000000 10.000000 8.549000e+03 0.600000 4.666667e+03 1.000000 15.000000 217.740000 0.000000 44.000000
75% 720.000000 739.000000 13.000000 1.952100e+04 0.840000 6.825000e+03 2.000000 32.000000 371.580000 0.000000 115.000000
max 880.000000 899.000000 59.000000 1.435667e+06 5.950000 1.750003e+06 8.000000 141.000000 2251.510000 33.000000 1189.000000

Monthly Payment

In [49]:
# Monthly payment histogram

bins = np.arange(0, 1300, 50)
df['MonthlyLoanPayment'].plot(kind = 'hist', bins = bins)
plt.xlim((0,1200))
plt.xlabel('Monthly Payment Amount')
plt.title('Monthly Payment Frequency');

Most people have monthly payment less than 400. But some people have monthly payment as high as 1000 dollar.

Investor

In [50]:
# Investor count, highest 20 shown
df['Investors'].value_counts().head(20).plot(kind = 'bar', color = base_color);
plt.xlabel('Number of Investor')
plt.ylabel('Count')
plt.title('Number of Investor Count');

number of investor only one is most notable.

Borrower State

In [51]:
# Borrower State count
plt.figure(figsize=(12,8))
df['BorrowerState'].value_counts().plot(kind = 'bar', color = base_color);
plt.xlabel('Borrower State')
plt.ylabel('Count')
plt.title('Borrower State Count');

number of borrower count in California is highest followed by Texas, New York and Florida.

Debt to Income Ratio

In [52]:
# cut continuous into discrete value
ratio_bin = np.arange(0,2,0.1)
bins = pd.cut(df['DebtToIncomeRatio'], ratio_bin)
debt_ratio = pd.value_counts(bins)
In [53]:
plt.figure(figsize= (12,8))
debt_ratio.plot(kind = 'bar', color = base_color);
plt.xlabel('Debt To Income Ratio')
plt.ylabel('Count')
plt.title('Debt To Income Ratio Count');

most debt to income ratio in between 0.1 to 0.2.

In [54]:
df['DebtToIncomeRatio'].describe()
Out[54]:
count    105383.000000
mean          0.275947
std           0.551759
min           0.000000
25%           0.140000
50%           0.220000
75%           0.320000
max          10.010000
Name: DebtToIncomeRatio, dtype: float64

Revolving Credit Balance

In [55]:
# cut continuous into discrete value
revol_bin = np.arange(0,df['RevolvingCreditBalance'].max(),5000)
bins = pd.cut(df['RevolvingCreditBalance'], revol_bin)
revol_balance = pd.value_counts(bins).head(20).sort_index()

plt.figure(figsize= (12,8))
revol_balance.plot(kind = 'bar', color = base_color);
plt.xlabel('Revolving Credit Balance')
plt.ylabel('Count')
plt.title('Revolving Credit Balance Count');
In [56]:
df['RevolvingCreditBalance'].describe()
Out[56]:
count    1.063330e+05
mean     1.759871e+04
std      3.293640e+04
min      0.000000e+00
25%      3.121000e+03
50%      8.549000e+03
75%      1.952100e+04
max      1.435667e+06
Name: RevolvingCreditBalance, dtype: float64

Loan Status

In [57]:
df['LoanStatus'].value_counts().plot(kind = 'bar', color = base_color)
plt.xlabel('Loan Status')
plt.ylabel('Count')
plt.title('Loan Status Count');

most people are in current borrower followed by who completed.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

when I plotted monthly payment in histogram I observed that it right skewed. I did not perform any transformation because at this point I was just investigating the distribution. Although I observed that number of investor in most cases only one. California is the highest Borrower state, that make sense because population is higher.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When I was investigating loan original amount I found that rather than making histogram it is more appropriate to plot as bar plot. So I make transform the feature into amount range.

Bivariate Plot

Loan Origianation Date vs Original Amount

In [58]:
# since main one contain lots of variable , make new data frame from main one 
df_bi = df.loc[ :, ('LoanStatus',
                    'BorrowerAPR', 
                    'BorrowerRate',
                    'LenderYield',
                    'ProsperScore',
                    'BorrowerState',
                    'Occupation', 
                    'EmploymentStatus',
                    'IsBorrowerHomeowner',
                    'CreditScoreRangeLower',
                    'CreditScoreRangeUpper',
                    'RevolvingCreditBalance', 
                    'BankcardUtilization','DebtToIncomeRatio', 
                    'IncomeRange','LoanMonthsSinceOrigination',
                    'LoanOriginalAmount',
                    'LoanOriginationDate',
                    'MonthlyLoanPayment',
                    'LoanOriginationQuarter',
                    'Investors'
                   )]
In [59]:
# create subset dataframe "loan_origin" for time series plot

loan_origin = df_bi.loc[:, ('LoanOriginationDate','LoanOriginalAmount')]
In [60]:
loan_origin.isnull().sum()
Out[60]:
LoanOriginationDate    0
LoanOriginalAmount     0
dtype: int64
In [61]:
# conver to date time
loan_origin['LoanOriginationDate'] = pd.to_datetime(loan_origin['LoanOriginationDate'])
In [62]:
# make date time index column
loan_origin.set_index('LoanOriginationDate', inplace=True)
In [63]:
# to avoid over plotting, will resample form original,
# look at monthly data
monthly = loan_origin.resample('M').sum()
In [64]:
monthly.plot();
plt.xlabel('Loan Origination Month')
plt.ylabel('Loan Original Amount')
plt.title('Loan Origination Month vs Loan Original Amount');

Loan amount increased over time except in 2009. Decreased in 2009 is may be for financial crisis.

In [65]:
# plot daily data
daily = loan_origin.resample('D').sum()
daily.plot()
plt.xlabel('Loan Origination Date')
plt.ylabel('Loan Original Amount')
plt.title('Loan Origination date vs Loan Original Amount');
In [66]:
# using moving average form smoother plot
daily.rolling(50, center=True, win_type='gaussian').sum(std = 10).plot()
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a23f82ef0>

Make new variable LoanOriginationYear

In [67]:
# keep only year value
df_bi['LoanOriginationYear'] = pd.to_datetime(df_bi['LoanOriginationDate']).dt.year
In [68]:
df_bi.head()
Out[68]:
LoanStatus BorrowerAPR BorrowerRate LenderYield ProsperScore BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower ... BankcardUtilization DebtToIncomeRatio IncomeRange LoanMonthsSinceOrigination LoanOriginalAmount LoanOriginationDate MonthlyLoanPayment LoanOriginationQuarter Investors LoanOriginationYear
0 Completed 0.16516 15.80 0.1380 NaN CO Other Self-employed True 640.0 ... 0.00 0.17 $25,000-49,999 78 9425 2007-09-12 00:00:00 330.43 Q3 2007 258 2007
1 Current 0.12016 9.20 0.0820 7.0 CO Professional Employed False 680.0 ... 0.21 0.18 $50,000-74,999 0 10000 2014-03-03 00:00:00 318.93 Q1 2014 1 2014
2 Completed 0.28269 27.50 0.2400 NaN GA Other Not available False 480.0 ... NaN 0.06 Not displayed 86 3001 2007-01-17 00:00:00 123.32 Q1 2007 41 2007
3 Current 0.12528 9.74 0.0874 9.0 GA Skilled Labor Employed True 800.0 ... 0.04 0.15 $25,000-49,999 16 10000 2012-11-01 00:00:00 321.45 Q4 2012 158 2012
4 Current 0.24614 20.85 0.1985 4.0 MN Executive Employed True 680.0 ... 0.81 0.26 $100,000+ 6 15000 2013-09-20 00:00:00 563.97 Q3 2013 20 2013

5 rows × 22 columns

Loan Origination Year vs Borrower Rate

In [69]:
# boxplot
sns.boxplot(data = df_bi, x = 'LoanOriginationYear',
            y = 'BorrowerRate', color = base_color)
plt.xlabel('Loan Origination Year')
plt.ylabel('Borrower Rate')
plt.title('Loan Origination Year vs Borrower Rate');

borrower rate increased from 2008 to 2011 and then decreased after that.

In [70]:
# same as above, only violin plot used
sns.violinplot(data = df_bi, x = 'LoanOriginationYear',
               y = 'BorrowerRate', color = base_color);
plt.xlabel('Loan Origination Year')
plt.ylabel('Borrower Rate')
plt.title('Loan Origination Year vs Borrower Rate');
/Users/YOUSUF/anaconda3/envs/analyst/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [71]:
# same as above, only different type plot
sns.pointplot(data = df_bi, x = 'LoanOriginationYear',
              y = 'BorrowerRate', color = base_color)
plt.xlabel('Loan Origination Year')
plt.ylabel('Borrower Rate')
plt.title('Loan Origination Year vs Borrower Rate');

Loan Origination Year vs Borrower APR

In [72]:
sns.boxplot(data = df_bi, x = 'LoanOriginationYear', 
            y = 'BorrowerAPR', color = base_color);
plt.xlabel('Loan Origination Year')
plt.ylabel('Borrower APR')
plt.title('Loan Origination Year vs Borrower APR');

borrower APR increased from 2008 to 2011 and then decreased after that.

Loan Origination Year vs Prosper Score

In [73]:
sns.boxplot(data = df_bi, x = 'LoanOriginationYear',
            y = 'ProsperScore', color = base_color);
plt.xlabel('Loan Origination Year')
plt.ylabel('Prosper Score')
plt.title('Loan Origination Year vs Prosper Score');

Mean Prosper score changes over the time.

Loan Original Amount vs Employment Status

In [74]:
sns.boxplot(data = df_bi, y = 'LoanOriginalAmount', 
            x = 'EmploymentStatus', 
            color = base_color);

plt.xlabel('Employment Status')
plt.ylabel('Loan Original Amount')
plt.title('Loan Original Amount vs Employment Status');
plt.xticks(rotation = 15);

Mean loan amount was highest in employed group.

Employment Status vs Prosper Score

In [75]:
sns.boxplot(data = df_bi, y = 'ProsperScore', 
            x = 'EmploymentStatus', 
            color = base_color);

plt.xlabel('Employment Status')
plt.ylabel('Prosper Score')
plt.title('Employment Status vs Prosper Score');
plt.xticks(rotation = 15);

Mean Prosper score was highest in full-time employment status category.

Employent Status vs Borrower APR

In [76]:
sns.boxplot(data = df_bi, y = 'BorrowerAPR', 
            x = 'EmploymentStatus', color = base_color);

plt.xlabel('Employment Status')
plt.ylabel('Borrower APR')
plt.title('Employent Status vs Borrower APR');
plt.xticks(rotation = 15);

Borrower APR was highest in not employed group.

In [77]:
df_bi.groupby('EmploymentStatus')['BorrowerAPR'].mean()
Out[77]:
EmploymentStatus
Employed         0.223558
Full-time        0.204604
Not available    0.200607
Not employed     0.275757
Other            0.247302
Part-time        0.202211
Retired          0.215450
Self-employed    0.230295
Name: BorrowerAPR, dtype: float64

Employment Status vs Borrower Rate

In [78]:
sns.boxplot(data = df_bi, y = 'BorrowerRate', 
            x = 'EmploymentStatus', 
            color = base_color);

plt.xlabel('Employment Status')
plt.ylabel('Borrower Rate')
plt.title('Employent Status vs Borrower Rate');
plt.xticks(rotation = 15);

Borrower rate was highest in not employed group.

In [79]:
df_bi.groupby('EmploymentStatus')['BorrowerRate'].mean()
Out[79]:
EmploymentStatus
Employed         19.279058
Full-time        18.700599
Not available    19.149248
Not employed     24.407880
Other            21.369622
Part-time        18.440028
Retired          19.444201
Self-employed    20.226863
Name: BorrowerRate, dtype: float64

Income Range vs Borrower APR

In [80]:
sns.boxplot(data = df_bi, y = 'BorrowerAPR', 
            x = 'IncomeRange', 
            color = base_color);

plt.xlabel('Income Range')
plt.ylabel('Borrower APR')
plt.title('Income Range vs Borrower APR');
plt.xticks(rotation = 15);

Borrower APR was highest for not employed, and lowest for who earn more than 100k.

In [81]:
df_bi.groupby('IncomeRange')['BorrowerAPR'].mean()
Out[81]:
IncomeRange
$0                0.210347
$1-24,999         0.247970
$100,000+         0.195626
$25,000-49,999    0.234936
$50,000-74,999    0.217744
$75,000-99,999    0.208070
Not displayed     0.198602
Not employed      0.279206
Name: BorrowerAPR, dtype: float64

Income Range vs Borrower Rate

In [82]:
sns.boxplot(data = df_bi, y = 'BorrowerRate', 
            x = 'IncomeRange', 
            color = base_color);

plt.xlabel('Income Range')
plt.ylabel('Borrower Rate')
plt.title('Income Range vs Borrower Rate');
plt.xticks(rotation = 15);

Borrower Rate was highest for not employed, and lowest for who earn more than 100k.

create new variable "Status"

In [83]:
# those are current, completed and cancelled are categorize as completed otherwise defaulted
df_bi['Status'] = np.where((df_bi['LoanStatus'] == 'Current') | 
                         (df_bi['LoanStatus'] == 'Completed')| 
                         (df_bi['LoanStatus'] == 'Cancelled'), 
                         'Completed', 'Defaulted')
In [84]:
df_bi.head()
Out[84]:
LoanStatus BorrowerAPR BorrowerRate LenderYield ProsperScore BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower ... DebtToIncomeRatio IncomeRange LoanMonthsSinceOrigination LoanOriginalAmount LoanOriginationDate MonthlyLoanPayment LoanOriginationQuarter Investors LoanOriginationYear Status
0 Completed 0.16516 15.80 0.1380 NaN CO Other Self-employed True 640.0 ... 0.17 $25,000-49,999 78 9425 2007-09-12 00:00:00 330.43 Q3 2007 258 2007 Completed
1 Current 0.12016 9.20 0.0820 7.0 CO Professional Employed False 680.0 ... 0.18 $50,000-74,999 0 10000 2014-03-03 00:00:00 318.93 Q1 2014 1 2014 Completed
2 Completed 0.28269 27.50 0.2400 NaN GA Other Not available False 480.0 ... 0.06 Not displayed 86 3001 2007-01-17 00:00:00 123.32 Q1 2007 41 2007 Completed
3 Current 0.12528 9.74 0.0874 9.0 GA Skilled Labor Employed True 800.0 ... 0.15 $25,000-49,999 16 10000 2012-11-01 00:00:00 321.45 Q4 2012 158 2012 Completed
4 Current 0.24614 20.85 0.1985 4.0 MN Executive Employed True 680.0 ... 0.26 $100,000+ 6 15000 2013-09-20 00:00:00 563.97 Q3 2013 20 2013 Completed

5 rows × 23 columns

In [85]:
# bar diagram to count status
sns.countplot(data = df_bi, x = 'Status');
plt.xlabel('Status')
plt.ylabel('Count')
plt.title('Status Count');

Status vs Occupation

In [86]:
order_type = df_bi['Occupation'].value_counts().index
In [87]:
plt.figure(figsize=(8,12))
sns.countplot(data = df_bi, y = 'Occupation', hue = 'Status', order = order_type);
plt.xlabel('Count')
plt.ylabel('Occupation')
plt.title('Status vs Occupataion');
In [88]:
order_type = df_bi['Occupation'].value_counts().index[2:20]
plt.figure(figsize=(8,12))
sns.countplot(data = df_bi, y = 'Occupation', hue = 'Status', order = order_type);
plt.xlabel('Count')
plt.ylabel('Occupation')
plt.title('Status vs Occupataion');

hard to tell from first plot if there any group that defaulted most. from second plot, it seems Clerical group defaulted rate is higher than any other group.

Status vs Employment

In [89]:
sns.countplot(data = df_bi, x = 'EmploymentStatus', hue = 'Status');

plt.xlabel('Employment Status')
plt.ylabel('Count')
plt.title('Status vs Employment Status');
plt.xticks(rotation = 15);

full time group defaulted notably.

Status vs Income Range

In [90]:
sns.countplot(data = df_bi, x = 'IncomeRange', hue = 'Status');

plt.xlabel('Income Range')
plt.ylabel('Count')
plt.title('Status vs Income Range');
plt.xticks(rotation = 15);

Proportion of defaulter higher for not displayed group followed by 1-24999 dollar group.

Status vs Borrower State

In [91]:
order_type_state = df_bi['BorrowerState'].value_counts().index
In [92]:
plt.figure(figsize=(8,12))
sns.countplot(data = df_bi, y = 'BorrowerState', hue = 'Status', order = order_type_state);

plt.xlabel('Count')
plt.ylabel('Borrower State')
plt.title('Status vs Borrower State')
plt.legend(loc = 1);

Hard to distinguish if there any state that defaulted most.

Status vs Prosper Score

In [93]:
sns.countplot(data = df_bi, x = 'ProsperScore', hue = 'Status');

plt.xlabel('Prosper Score')
plt.ylabel('Count')
plt.title('Status vs Prosper Score');
In [94]:
#plt.figure(figsize=(8,12))
order_type = df_bi['ProsperScore'].value_counts().index
sns.countplot(data = df_bi, x = 'ProsperScore', hue = 'Status', order = order_type);

plt.xlabel('Prosper Score')
plt.ylabel('Count')
plt.title('Status vs Prosper Score');

score 1 is quite alarming. They have around 50 % chance being defaulted.

Status vs Borrower Rate

In [95]:
# masking 
completed = df_bi['Status'] == 'Completed'
defaulted = df_bi['Status'] == 'Defaulted'
In [96]:
df['BorrowerRate'][completed].hist(label = 'Completed', alpha = 0.5, bins = 30)
df['BorrowerRate'][defaulted].hist(label = 'Defaulted', alpha = 0.5, bins = 30)

plt.xlabel('Borrower Rate')
plt.ylabel('Count')
plt.title('Status vs Borrower Rate');
plt.legend();

when borrower rate gets higher chance of being defaulted gets higher.

Status vs Borrower APR

In [97]:
completed = df_bi['Status'] == 'Completed'
defaulted = df_bi['Status'] == 'Defaulted'
df['BorrowerAPR'][completed].hist(label = 'Completed', alpha = 0.5, bins = 30)
df['BorrowerAPR'][defaulted].hist(label = 'Defaulted', alpha = 0.5, bins = 30)

plt.xlabel('Borrower APR')
plt.ylabel('Frequency')
plt.title('Status vs Borrower APR');
plt.legend();

in general, as APR increase number of defaulter increases.

Status vs Credit Score

In [98]:
completed = df_bi['Status'] == 'Completed'
defaulted = df_bi['Status'] == 'Defaulted'
bins = np.arange(400, 900, 20)
df['CreditScoreRangeLower'][completed].plot(kind = 'hist', label= 'Completed', bins = bins, alpha = 0.5)
df['CreditScoreRangeLower'][defaulted].plot(kind = 'hist', label= 'Defaulted' ,bins = bins,alpha = 0.5)

plt.xlabel('Credit Score Range Lower')
plt.ylabel('Frequency')
plt.title('Status vs Credit Score Range Lower');
plt.legend()
plt.xlim((400, 900));
In [99]:
completed = df_bi['Status'] == 'Completed'
defaulted = df_bi['Status'] == 'Defaulted'
bins = np.arange(400, 900, 20)
df['CreditScoreRangeUpper'][completed].plot(kind = 'hist', label= 'Completed', bins = bins, alpha = 0.5)
df['CreditScoreRangeUpper'][defaulted].plot(kind = 'hist', label= 'Defaulted' ,bins = bins,alpha = 0.5)

plt.xlabel('Credit Score Range Upper')
plt.ylabel('Frequency')
plt.title('Status vs Credit Score Range Upper');
plt.legend()
plt.xlim((400, 900));

with credit score less than 600 get defaulted highest.

In [100]:
df_bi.shape[0]
Out[100]:
113937

Data Subset

In [101]:
# set seed
np.random.seed(2018)
In [102]:
# sample dataframe
sample = np.random.choice(df_bi.shape[0], 10000, replace= False)
df_subset = df_bi.loc[sample]
In [103]:
df_subset.head()
Out[103]:
LoanStatus BorrowerAPR BorrowerRate LenderYield ProsperScore BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower ... DebtToIncomeRatio IncomeRange LoanMonthsSinceOrigination LoanOriginalAmount LoanOriginationDate MonthlyLoanPayment LoanOriginationQuarter Investors LoanOriginationYear Status
3450 Completed 0.20735 20.00 0.1900 NaN NaN Professional Full-time False 620.0 ... 0.24 $25,000-49,999 82 15000 2007-05-16 00:00:00 557.45 Q2 2007 241 2007 Completed
103816 Completed 0.29537 25.66 0.2466 3.0 PA Nurse (RN) Employed False 660.0 ... 0.17 $50,000-74,999 9 4000 2013-06-19 00:00:00 160.44 Q2 2013 1 2013 Completed
70633 Current 0.14714 11.89 0.1089 6.0 MA Other Employed False 740.0 ... 0.34 $25,000-49,999 3 12000 2013-12-06 00:00:00 397.94 Q4 2013 1 2013 Completed
55747 Current 0.20053 16.39 0.1539 6.0 OR Other Other True 720.0 ... 0.51 $25,000-49,999 7 10000 2013-08-29 00:00:00 353.50 Q3 2013 1 2013 Completed
67600 Chargedoff 0.25259 21.48 0.2048 6.0 FL Sales - Retail Employed True 740.0 ... 0.32 $25,000-49,999 22 10000 2012-05-17 00:00:00 379.22 Q2 2012 78 2012 Defaulted

5 rows × 23 columns

Prosper Score vs Borrower Rate

In [104]:
plt.figure(figsize=(8,6))
sns.regplot(data = df_subset, x = 'ProsperScore', y = 'BorrowerRate', 
            x_jitter= 0.3,scatter_kws={'alpha':1/50});

plt.xlabel('Prosper Score')
plt.ylabel('Borrower Rate')
plt.title('Prosper Score vs Borrower Rate');
/Users/YOUSUF/anaconda3/envs/analyst/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [105]:
# quite same as above, only different type plot used
plt.figure(figsize=(8,6))
sns.stripplot(data = df_subset, x = 'ProsperScore', y = 'BorrowerRate',jitter = .2);

plt.xlabel('Prosper Score')
plt.ylabel('Borrower Rate')
plt.title('Prosper Score vs Borrower Rate');

Borrower rate is negatively correlated with Prosper Score.

Skip following plot.

In [106]:
plt.figure(figsize=(8,6))
sns.catplot(x="EmploymentStatus", y="BorrowerRate", data=df_subset, kind='boxen');

plt.xlabel('Employment Status')
plt.ylabel('Borrower Rate')
plt.title('Employment Status vs Borrower Rate');
plt.xticks(rotation = 15);
<Figure size 576x432 with 0 Axes>

Correlation Among Numeric Variables

In [107]:
heat = ['BorrowerAPR','BorrowerRate','LenderYield', 'ProsperScore', 'CreditScoreRangeLower',
       'CreditScoreRangeUpper', 'RevolvingCreditBalance','BankcardUtilization', 'DebtToIncomeRatio', 
       'LoanOriginalAmount']
In [108]:
sns.heatmap(df_subset[heat].corr());
In [109]:
sns.heatmap(df_subset[heat].corr(), cmap = 'rocket_r', annot = True, fmt = '.2f');

Borrower rate and Borrower APR is negatively correlated with Prosper Score.

Home owner vs Status

In [110]:
# is borrower home owner
plt.figure(figsize=(6,4))
sns.countplot( data = df_bi, x = 'IsBorrowerHomeowner', hue = 'Status');
plt.xlabel("Home Owner")
plt.ylabel("Count")
plt.title("Is Borrower Home Owner");

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Original loan amount was lowest just after 2008 financial crisis, after that the amount is increasing over the year. From 2013 to 2014 , amount gets really big. Borrower Rate and APR increased until 2011 then decreasing. Higher Income people gets lower borrower rate and APR.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It is hard to tell whether one will be defaulted knowing his/her occupation, State and whether home owner or not.

Multivariate Plot

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

In [111]:
#bins = np.arange(0, 35000, 1000)
#g = sns.FacetGrid(data = df_subset, col = 'EmploymentStatus', col_wrap= 3);
#g.map(plt.hist, 'LoanOriginalAmount', bins = bins)
In [112]:
# %%script false

# skip this, only for future reference for myself
#g = sns.FacetGrid(df_subset, col="EmploymentStatus",col_wrap = 3)
#g.map(sns.boxplot, "LoanOriginationYear", "CreditScoreRangeUpper");
In [113]:
#%%script false
#g = sns.FacetGrid(df_subset, col="IncomeRange",col_wrap = 3, hue = 'Status' )
#g.map(sns.scatterplot, "BorrowerRate", "ProsperScore", alpha = 0.3);

Employment Status and Borrower Rate vs Status

In [114]:
sns.boxplot(data = df_bi, y = 'BorrowerRate', 
            x = 'EmploymentStatus', hue = "Status");

plt.xlabel('Employment Status')
plt.ylabel('Borrower Rate')
plt.title('Employment Status and Borrower Rate vs Status');
plt.xticks(rotation = 15);
In [115]:
# same as above only different type plot
sns.pointplot(data = df_bi, y = 'BorrowerRate', 
            x = 'EmploymentStatus', hue = "Status");

plt.xlabel('Employment Status')
plt.ylabel('Borrower Rate')
plt.title('Employment Status and Borrower Rate vs Status');
plt.xticks(rotation = 15);
In [116]:
# same as above only different type plot
sns.pointplot(data = df_bi, y = 'BorrowerRate', 
            x = 'EmploymentStatus', hue = "Status", dodge = True, linestyles = "", ci = 'sd');

plt.xlabel('Employment Status')
plt.ylabel('Borrower Rate')
plt.title('Employment Status and Borrower Rate vs Status');
plt.xticks(rotation = 15);
In [117]:
# same as above only different type plot
sns.barplot(data = df_bi, y = 'BorrowerRate', 
            x = 'EmploymentStatus', hue = "Status", ci = 'sd');

plt.xlabel('Employment Status')
plt.ylabel('Borrower Rate')
plt.title('Employment Status and Borrower Rate vs Status');
plt.xticks(rotation = 15);

Borrower rate is higher for defaulter in each group.

Borrower APR and Employment Status vs Status

In [118]:
sns.pointplot(data = df_bi, y = 'BorrowerAPR', 
            x = 'EmploymentStatus', hue = "Status", dodge = True, linestyles = "", ci = 'sd');

plt.xlabel('Employment Status')
plt.ylabel('Borrower APR')
plt.title('Employment Status and Borrower APR vs Status');
plt.xticks(rotation = 15);

Borrower APR is higher for defaulter in each group.

Loan Origination Year and Borrower Rate vs Status

In [119]:
sns.boxplot(data = df_bi, x = 'LoanOriginationYear', 
            y = 'BorrowerRate', hue = "Status", dodge = "True");

plt.xlabel('Loan Origination Year')
plt.ylabel('Borrower Rate')
plt.title('Loan Origination Year and Borrower Rate vs Status');
In [120]:
sns.pointplot(data = df_bi, x = 'LoanOriginationYear', 
            y = 'BorrowerRate', hue = "Status");

plt.xlabel('Loan Origination Year')
plt.ylabel('Borrower Rate')
plt.title('Loan Origination Year and Borrower Rate vs Status');

Borrower rate is higher for defaulter in each Year.

In [121]:
# just differernt type plot
sns.scatterplot(data = df_subset, x = 'LoanOriginationYear', y = 'BorrowerRate', 
                hue = "Status",alpha = 0.3);

plt.xlabel('Loan Origination Year')
plt.ylabel('Borrower Rate')
plt.title('Loan Origination Year and Borrower Rate vs Status');

Credit Score Range Upper and Borrower Rate vs Status

In [122]:
sns.scatterplot(data = df_subset, x = 'CreditScoreRangeUpper', y = 'BorrowerRate', 
                hue = "Status",alpha = 0.3);

plt.xlabel('Credit Score Range Upper')
plt.ylabel('Borrower Rate')
plt.title('Credit Score Range Upper and Borrower Rate vs Status');
plt.xlim((400, 900));

when credit score is lower, they get higher borrower rate and defaulted most.

END

In [123]:
df_subset.dropna(inplace = True)
In [124]:
# skip

#plt.hist2d(data = df_subset, x = 'ProsperScore', y = 'BorrowerRate', cmin =0.5, cmap = 'viridis_r')

#plt.colorbar()

#plt.xlabel('Prosper score')
#plt.ylabel('borrower rate');

This is only for myself. Skip this
Univariate Summary:

bar plot, count plot

  • x = category variable ( count)
  • x = discrete numeric variable

(pie chart can be used in bar chart)

histogram

  • x = continuous numeric variable

Bivariate:

boxplot, violin plot, point plot, catplot

  • x = discrete numeric variable, y = numeric variable(continuous or discrete)
  • x = category variable, y = numeric variable (continuous or discrete)

clustered bar plot (countplot, bar plot)

  • x = category or discrete numeric, y = count, hue= category

overlap histogram

  • x = continuous numeric variable, hue = category

regplot , striplot

  • x = numeric (discrete or continuous) , y = numeric

scatter plot

  • numeric, numeric

heatmap

  • correlation among numeric variables

multivaraite:

clustered box plot, violin plot,point plot, catplot

  • x = discrete numeric variable, y = numeric variable(cont or discrete), hue = category
  • x = category variable, y = numeric variable (cont or discrete), hue = category

clustered bar plot (countplot, bar plot)

  • x = category or discrete numeric, y = numeric , hue= category

scatter plot

  • x = numeric, y= numeric, hue = category